Chapter · Scaling & Performance

Mastering Scaling
& Performance

Performance engineering is the discipline of understanding how systems behave under load, where bottlenecks hide, and how to think clearly when everything seems to be falling apart. This chapter builds intuition layer by layer — from the raw definition of "fast" through latency measurement, database tuning, caching strategy, and the two fundamental modes of scaling. Everything is grounded in practical backend engineering.

What Do We Mean by "Fast"?

When a user clicks a button on your web app, a cascade of events fires: the browser packages an HTTP request, that request traverses the internet to your server, the server processes it (possibly hitting a database, calling an external API, or sending an email), and finally the server assembles a response — typically JSON — which travels back across the internet. The browser then parses the response and renders the result on screen.

The total elapsed time between the click and the final render is what the user perceives as speed. When someone says "your app is slow," they're describing this end-to-end duration — even if they've never heard the word latency.

Fig 1 · Request lifecycle — every millisecond adds up

Performance is not a vague, subjective quality. It can — and must — be expressed in numbers. The three fundamental metrics are latency (how long one request takes), throughput (how many requests per unit time), and utilization (what percentage of capacity is in use). Together, they form the vocabulary of performance engineering.

Latency Deep Dive

Latency is the time elapsed from the moment a request is sent to the moment a response is received. It is the single most user-visible performance metric — the number your customers feel in their fingers.

Latency is not a single number

A common mistake is to treat latency as one fixed value like "our API responds in 200ms." In reality, latency varies from request to request because of many real-world factors:

Cache hits vs. misses — a request served from a CDN or an in-memory cache like Redis may return in 5ms, while the same request hitting the database cold takes 200ms. Server load — when your server is idle, it picks up each request immediately; when it's processing 50 concurrent requests, the new arrival waits in a queue. Network conditions — packet loss, route changes, and geographic distance all introduce variability. Request complexity — fetching a user profile is faster than generating a monthly analytics report.

Why averages lie

Imagine 1,000 requests: 990 complete in 50ms and 10 take 5,000ms. The average is ~100ms — which sounds healthy. But those 10 outlier requests represent real users staring at a loading spinner for 5 seconds. At 1 million requests/day, that's 10,000 terrible user experiences per day — completely invisible to the average.

Because averages absorb variation, they are nearly useless for performance analysis. What we need instead is a way to talk about the distribution of latencies — which brings us to percentiles.

Percentiles: P50, P90, P99

A percentile tells you the latency threshold below which a certain percentage of requests fall. The three percentiles you'll encounter constantly are:

Percentile	Meaning	Example
P50 (median)	50% of requests are faster than this value	P50 = 80ms → half your users see ≤80ms
P90	90% faster; 10% of users experience this or worse	P90 = 400ms → 1 in 10 users waits 400ms+
P99	99% faster; 1% of users experience this or worse	P99 = 2s → 1 in 100 users waits 2 seconds+

Fig 2 · Latency distribution with percentile markers

Why P99 and P95 matter most

Backend engineers focus on tail latencies (P95/P99) for two reasons. First, the requests that land in the tail are typically the most complex — they involve the heaviest business logic, the most joins, or the most service-to-service calls. Second, these requests often represent your most valuable users: someone making a purchase, generating a report, or completing a checkout flow — exactly the workflows where a 5-second hang loses revenue and trust.

Measuring percentiles in Go

The Go standard library doesn't ship a percentile calculator, but you can compute them from sorted slices or use a library. Here's a simple implementation from scratch:

Go
package main

import (
    "fmt"
    "math"
    "sort"
    "time"
)

// Percentile returns the p-th percentile from a slice of durations.
// p must be between 0 and 100.
func Percentile(latencies []time.Duration, p float64) time.Duration {
    if len(latencies) == 0 {
        return 0
    }
    sorted := make([]time.Duration, len(latencies))
    copy(sorted, latencies)
    sort.Slice(sorted, func(i, j int) bool {
        return sorted[i] < sorted[j]
    })

    rank := (p / 100.0) * float64(len(sorted)-1)
    idx  := int(math.Ceil(rank))
    return sorted[idx]
}

func main() {
    // Simulated latency samples
    samples := []time.Duration{
        12 * time.Millisecond, 15 * time.Millisecond,
        22 * time.Millisecond, 48 * time.Millisecond,
        95 * time.Millisecond, 110 * time.Millisecond,
        340 * time.Millisecond, 890 * time.Millisecond,
        1200 * time.Millisecond, 4800 * time.Millisecond,
    }

    fmt.Printf("P50:  %v\n", Percentile(samples, 50))
    fmt.Printf("P90:  %v\n", Percentile(samples, 90))
    fmt.Printf("P99:  %v\n", Percentile(samples, 99))
}

Measuring percentiles in Python

Python
import numpy as np

# Simulated latency samples (in milliseconds)
samples = [12, 15, 22, 48, 95, 110, 340, 890, 1200, 4800]

print(f"P50:  {np.percentile(samples, 50):.0f}ms")
print(f"P90:  {np.percentile(samples, 90):.0f}ms")
print(f"P99:  {np.percentile(samples, 99):.0f}ms")
print(f"P99.9:{np.percentile(samples, 99.9):.0f}ms")

📖 Reference: Prometheus — Histograms & Summaries — the industry standard for collecting percentile data in production.

Throughput

While latency describes the duration of a single request, throughput measures how many requests your system can handle in a given time window — typically expressed as requests per second (RPS) or requests per minute (RPM).

The two metrics are deeply intertwined but not in a simple, linear way. Your system might show excellent latency at 10 RPS — say 50ms per request. But at 5,000 RPS, latency might spike to 2 seconds. Understanding throughput alongside latency lets you answer critical operational questions: Can we survive the Black Friday traffic spike? How many concurrent users can we support before we need more resources? What happens if we get featured on a popular podcast tomorrow?

Throughput = Requests Handled / Time Period

The latency–throughput curve

As throughput increases, latency increases slowly at first but then dramatically — it follows a curve, not a line. At low throughput, the server is mostly idle and each request gets served instantly. As throughput climbs toward capacity, requests begin queuing and contending for CPU, memory, and I/O. Near saturation, latency explodes exponentially.

Fig 3 · Latency grows exponentially as throughput approaches capacity

This relationship is governed by queuing theory — specifically the M/M/1 queue model from operations research. As utilization ρ approaches 1.0, the average wait time tends toward infinity:

W = S / (1 − ρ) where S = service time, ρ = utilization

📖 Reference: MDN — Web Performance — Mozilla's comprehensive guide to front-to-back performance measurement.

Utilization & Queuing

Utilization is the percentage of your system's total capacity that is actively in use. At 0% the system is idle; at 100% every resource is saturated and nothing new can be processed.

The ice cream shop analogy

Imagine an ice cream shop with one worker. On a quiet Sunday evening, you walk in, order, and get served instantly — low utilization, low latency. On Tuesday lunchtime, there's a long queue. The worker prepares each cone at the exact same speed, but your wait time has skyrocketed because you must wait for everyone ahead of you. The worker's throughput hasn't changed. Your perceived latency has.

Backend systems behave identically: when CPU and memory utilization are low, requests are served instantly. As more requests arrive, each one waits for the requests ahead of it to complete — forming a queue. The higher the utilization, the longer the queue, and the longer each request waits.

The highway analogy

Think of a highway at different capacity levels. At 50%, traffic flows smoothly and cars maintain speed. At 80%, lane changes require more caution and speeds begin to drop. At 90%, the system becomes chaotic — sometimes flowing, sometimes jamming. Every brake tap creates a ripple effect. At 100%, total gridlock: no one moves.

The exponential relationship

The relationship between utilization and latency is not linear. We intuitively expect doubling utilization to double latency, but the reality is far worse. Latency stays relatively flat until roughly 70% utilization, then curves sharply upward, and near 100% goes nearly vertical — approaching infinity.

Key Takeaway

You can never run your systems at 100% utilization and expect them to perform. Production systems typically target 60–80% utilization, reserving 20–40% as headroom for traffic bursts. Traffic is bursty by nature — it arrives in waves, not at a steady rate — and your buffer is what absorbs those spikes.

Finding Bottlenecks

When your system is slow, something specific is causing the slowness. That something is called the bottleneck. The most critical discipline in performance engineering is identifying the actual bottleneck before applying a fix — not guessing.

The trap of jumping to solutions

When engineers encounter a slow API, the instinct is to reach for standard by-the-book remedies: add Redis caching, upgrade Postgres, throw more servers at it. Sometimes you get lucky. But more often, you spend days or weeks implementing a solution to a problem you don't actually have, while the real bottleneck sits untouched.

A real-world example

Consider a GET /products/:id endpoint that feels slow. The knee-jerk reaction: the database must be the problem. You spend a week building a Redis caching layer. You deploy. The API is still slow.

Now you add timing measurements throughout the code and discover: the database query takes 10ms, the new Redis lookup takes 5ms, but a synchronous logging call to a remote Elasticsearch service takes 500ms — blocking the request while it waits for a network response. The database was never the problem.

The golden rule

Never guess. Always measure. Common hidden bottlenecks include: synchronous logging to remote services, JSON/XML serialization of large payloads, external API calls in loops, large response payloads causing slow network transmission, and middleware that runs on every request.

Timing middleware in Go

Here's a practical middleware that logs per-request timing, so you can start identifying slow endpoints:

Go
package middleware

import (
    "log"
    "net/http"
    "time"
)

// TimingMiddleware logs the duration of every HTTP request.
func TimingMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()

        // Wrap the ResponseWriter to capture the status code
        rw := &statusWriter{ResponseWriter: w, status: 200}
        next.ServeHTTP(rw, r)

        duration := time.Since(start)
        log.Printf(
            "method=%s path=%s status=%d duration=%v",
            r.Method, r.URL.Path, rw.status, duration,
        )
    })
}

type statusWriter struct {
    http.ResponseWriter
    status int
}

func (w *statusWriter) WriteHeader(code int) {
    w.status = code
    w.ResponseWriter.WriteHeader(code)
}

Timing decorator in Python (FastAPI)

Python
import time
import logging
from fastapi import Request
from starlette.middleware.base import BaseHTTPMiddleware

logger = logging.getLogger("perf")

class TimingMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request: Request, call_next):
        start = time.perf_counter()
        response = await call_next(request)
        duration_ms = (time.perf_counter() - start) * 1000

        logger.info(
            f"method={request.method} path={request.url.path} "
            f"status={response.status_code} duration={duration_ms:.1f}ms"
        )
        return response

Profiling & Flame Graphs

Profiling is the practice of measuring exactly where your application spends its time. A profiler attaches to your running application, samples the call stack at regular intervals, and records which functions are executing, how long they run, and how much CPU or memory they consume.

How a profiler works

CPU profilers (like Go's built-in pprof or Python's cProfile) periodically interrupt the program and capture a snapshot of the current call stack. After thousands of samples, a statistical picture emerges showing which functions dominate execution time.

Flame graphs

Raw profiler output can list thousands of functions, each accounting for some fraction of total time. Flame graphs make this data visual: each function is a horizontal bar, wider bars indicate more time spent, and functions called by other functions are stacked vertically. One glance at a flame graph tells you where your application's hotspots are.

Fig 4 · Flame graph — wider bars = more time; the logging call dominates

CPU profiling in Go with pprof

Go
import (
    "net/http"
    _ "net/http/pprof"  // side-effect import: registers /debug/pprof/
)

func main() {
    // Your normal app setup...
    // pprof endpoints are now accessible at /debug/pprof/
    // Collect a 30-second CPU profile:
    //   go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
    //   (pprof) web   ← opens flame graph in browser
    http.ListenAndServe(":6060", nil)
}

CPU profiling in Python

Python
import cProfile
import pstats

# Profile a function and print the top 20 slowest calls
with cProfile.Profile() as pr:
    handle_request()  # your function under test

stats = pstats.Stats(pr)
stats.sort_stats("cumulative")
stats.print_stats(20)

# For flame graphs, use py-spy (external tool):
#   pip install py-spy
#   py-spy record -o profile.svg -- python app.py

CPU-bound vs I/O-bound bottlenecks

Profilers excel at finding CPU-bound bottlenecks — functions that consume excessive computation cycles (image processing, machine learning inference, heavy math). However, most backend applications are I/O-bound: the time is spent waiting for database responses, external API calls, file reads, or message queue acknowledgments. CPU profilers are less effective here because the CPU isn't working — it's waiting. For I/O-bound work, you need distributed tracing.

Distributed Tracing

Distributed tracing follows a single request as it flows through your entire system — across services, databases, caches, and external APIs — recording timestamps at every boundary. The output is a trace: a tree of "spans," each representing a timed operation.

Where a profiler shows what the CPU did, a trace shows where the clock ticked. For example, a trace of GET /products/5 might reveal: 2ms in your API logic, 800ms in the database query, and 50ms in serialization. Now you know exactly where to focus.

Fig 5 · Distributed trace — the DB query dominates this request's timeline

Popular tracing tools include Jaeger, Zipkin, and OpenTelemetry (the emerging standard that unifies tracing, metrics, and logging). OpenTelemetry provides SDKs for Go, Python, Java, and most major languages.

📖 Reference: OpenTelemetry Documentation — the vendor-neutral standard for distributed tracing, metrics, and logs.

The N+1 Query Problem

The N+1 query problem is one of the most notorious performance anti-patterns in backend engineering. It occurs when you make 1 query to fetch a list of N items, then N additional queries to fetch a related piece of data for each item — totaling N+1 database round-trips.

How it happens

Imagine a blog application. To render the homepage, you need 20 blog posts with each post's author name. The naive approach: query for all 20 posts, then loop through and make a separate query for each post's author. That's 21 database queries. For 1,000 posts, it's 1,001 queries. Even at 5ms per query, 1,001 queries take over 5 seconds.

Fig 6 · N+1 queries vs. bulk fetch — same data, vastly different performance

The fix: batch fetching

Instead of querying in a loop, collect all the IDs you need and fetch them in a single query using WHERE id IN (...). The result: always 2 queries regardless of the number of items — one for the list, one for the related data.

Go — solving N+1 with a batch query

Go
func GetPostsWithAuthors(ctx context.Context, db *sql.DB) ([]PostWithAuthor, error) {
    // Query 1: fetch all posts
    posts, err := queryPosts(ctx, db)
    if err != nil {
        return nil, err
    }

    // Collect unique author IDs
    authorIDSet := make(map[int64]struct{})
    for _, p := range posts {
        authorIDSet[p.AuthorID] = struct{}{}
    }
    authorIDs := make([]int64, 0, len(authorIDSet))
    for id := range authorIDSet {
        authorIDs = append(authorIDs, id)
    }

    // Query 2: fetch ALL authors in one batch
    authors, err := queryAuthorsByIDs(ctx, db, authorIDs)
    if err != nil {
        return nil, err
    }

    // Build lookup map and merge
    authorMap := make(map[int64]Author)
    for _, a := range authors {
        authorMap[a.ID] = a
    }

    result := make([]PostWithAuthor, len(posts))
    for i, p := range posts {
        result[i] = PostWithAuthor{Post: p, Author: authorMap[p.AuthorID]}
    }
    return result, nil
}

Python (Django) — select_related & prefetch_related

Python
# BAD — N+1 queries (Django ORM)
posts = Post.objects.all()
for post in posts:
    print(post.author.name)  # Each access fires a separate query!

# GOOD — 1 query with JOIN (ForeignKey)
posts = Post.objects.select_related("author").all()

# GOOD — 2 queries (ManyToMany or reverse FK)
posts = Post.objects.prefetch_related("tags").all()

Every modern ORM provides bulk-fetch primitives: Django has select_related / prefetch_related, Ruby on Rails has includes, Prisma and Drizzle offer leftJoin and relation loading. If you're writing raw SQL, the answer is JOIN. The pattern to avoid is any loop that fires a query per iteration.

Pro tip

Most modern ORMs have a "print generated SQL" mode. Enable it during development to catch N+1 patterns before they reach production. In Django: settings.LOGGING with the django.db.backends logger. In Go's GORM: db.Debug().

Indexes & Query Plans

The most common source of database performance problems is the absence of appropriate indexes. An index is a data structure — typically a B-tree — that maintains a sorted copy of the values in one or more columns, along with pointers back to the full rows. It transforms a slow sequential scan (examining every row) into a fast lookup.

The library analogy

Imagine a library with a million books but no catalog. Finding all books by John Green requires walking every shelf, examining every book — a full table scan. Now add a catalog sorted by author name with shelf locations. You look up "Green, John," find three shelf references, walk directly to those shelves, and return in minutes instead of days. That catalog is the index.

Sequential scan vs. index scan

Without an index, your database performs a sequential scan (also called a full table scan) — it reads every row to check the WHERE clause. For a million rows, this can take several seconds. With a B-tree index, the database performs a logarithmic lookup — finding matching rows in a few milliseconds regardless of table size.

Sequential scan: O(n) → Index scan: O(log n)

Creating indexes

SQL
-- Basic single-column index
CREATE INDEX idx_posts_author_id ON posts (author_id);

-- Composite index: order matters!
-- This helps queries filtering by user_id, or user_id + created_at
-- It does NOT help queries filtering ONLY by created_at
CREATE INDEX idx_orders_user_created ON orders (user_id, created_at);

-- Covering index: all needed data lives IN the index
-- The query never touches the main table at all
CREATE INDEX idx_departments_name ON departments (name) INCLUDE (id);

Composite vs. Covering indexes

Composite Index

Indexes multiple columns together
Optimizes queries that filter on those columns
Column order matters — leftmost prefix rule
Index on (A, B) helps WHERE A=... and WHERE A=... AND B=... but not WHERE B=... alone

Covering Index

Includes all columns needed by a query
Database serves data entirely from the index
Eliminates the "heap lookup" (going back to the table)
Uses INCLUDE clause in Postgres to add non-key columns

The cost of indexes

Indexes are not free. Each index consumes disk space (and ideally fits in memory). More critically, every INSERT, UPDATE, or DELETE must update all associated indexes. If you blindly index every column, your write operations slow dramatically. The guideline: index columns that appear in WHERE, JOIN, and ORDER BY clauses of your most frequent queries, and leave the rest alone.

EXPLAIN ANALYZE — reading the query plan

To discover whether your queries use indexes, Postgres provides the EXPLAIN ANALYZE command. It runs the query and shows the exact execution plan — which tables were scanned, which indexes were used, and how long each step took.

SQL
EXPLAIN ANALYZE
SELECT p.*, u.name AS author_name
FROM posts p
JOIN users u ON u.id = p.author_id
WHERE p.published = true
ORDER BY p.created_at DESC
LIMIT 20;

-- Look for "Seq Scan" in the output — that means no index is being used.
-- After adding an index, re-run and confirm it shows "Index Scan".

📖 Reference: PostgreSQL — Using EXPLAIN — official documentation on reading query execution plans.

Connection Pooling

Establishing a database connection is expensive. Each new connection involves a TCP three-way handshake, authentication, encryption negotiation, session state setup, and memory allocation on the database server (typically several MB per connection). If your application opens a new connection for every query and closes it immediately, you're paying this cost on every single request.

The two problems

Problem 1: Latency overhead. Each connection setup adds tens of milliseconds of overhead — wasted time before any actual query runs. Problem 2: Connection exhaustion. Databases have hard limits on simultaneous connections (Postgres defaults to ~100, often configured to 300–500). During traffic spikes, your backend can exhaust the connection limit and crash the database.

Connection pooling

A connection pool maintains a set of pre-established, reusable connections. Instead of creating a new connection for each query, your application borrows a connection from the pool, uses it, and returns it. The connection stays open for reuse by the next request. This eliminates both the latency overhead and the risk of connection exhaustion.

Fig 7 · External connection pool (PgBouncer) sits between servers and database

Internal vs. external pooling

Internal Pool

Pool lives inside each server instance
Each instance maintains its own connections
Simple setup; most DB drivers include one
Risk: 3 instances × 150 connections = 450, which can exceed DB limit of 300

External Pool

Standalone process (PgBouncer, PgCat)
All server instances share one pool
Total connections to DB are centrally controlled
Preferred for horizontally scaled production systems

Connection pooling in Go

Go
import "database/sql"

db, err := sql.Open("postgres", connStr)
if err != nil {
    log.Fatal(err)
}

// Configure the built-in connection pool
db.SetMaxOpenConns(25)                  // max open connections to the DB
db.SetMaxIdleConns(10)                  // max idle connections kept ready
db.SetConnMaxLifetime(5 * time.Minute)  // recycle connections after 5 min
db.SetConnMaxIdleTime(1 * time.Minute)  // close idle connections after 1 min

Connection pooling in Python (SQLAlchemy)

Python
from sqlalchemy import create_engine

engine = create_engine(
    "postgresql://user:pass@host/db",
    pool_size=20,          # number of persistent connections
    max_overflow=10,       # extra connections allowed during spikes
    pool_timeout=30,       # seconds to wait for a free connection
    pool_recycle=1800,     # recycle connections every 30 minutes
    pool_pre_ping=True,    # test connection health before use
)

📖 Reference: PgBouncer — lightweight connection pooler for PostgreSQL, the de facto standard for production Postgres deployments.

Caching Fundamentals

Caching is the practice of storing the result of an expensive operation in a faster storage layer so that subsequent requests for the same data can be served without repeating the expensive work. The idea is beautifully simple: compute once, serve many.

In practice, you place a cache (typically Redis, Memcached, or Valkey) in front of your database. When a request arrives, you first check the cache. On a cache hit, you return the cached data immediately — often in under 5ms. On a cache miss, you query the database, store the result in the cache, and then return it. A single move can reduce perceived latency from 800ms to 50ms.

Local vs. distributed caching

Local Cache

In-process memory (dictionary/map)
Fastest possible access: 1–3ms
Private to each server instance
Problem: cache inconsistency across instances

Distributed Cache

External service (Redis, Memcached, Valkey)
Shared across all server instances
No inconsistency between servers
Con: network round-trip adds ~5–50ms

Tiered caching

Many production systems use both: a small local cache for the "hottest" data (most frequently accessed items) backed by a larger distributed cache. A request checks the local cache first (sub-millisecond), then falls back to the distributed cache (a few milliseconds), and only hits the database on a full miss.

Simple Redis caching in Go

Go
import (
    "context"
    "encoding/json"
    "time"
    "github.com/redis/go-redis/v9"
)

var rdb = redis.NewClient(&redis.Options{Addr: "localhost:6379"})

func GetProduct(ctx context.Context, id string) (*Product, error) {
    cacheKey := "product:" + id

    // 1. Try cache first
    cached, err := rdb.Get(ctx, cacheKey).Result()
    if err == nil {
        var p Product
        json.Unmarshal([]byte(cached), &p)
        return &p, nil  // Cache HIT — ~2ms
    }

    // 2. Cache miss — query database
    p, err := queryProductFromDB(ctx, id)
    if err != nil {
        return nil, err
    }

    // 3. Store in cache with TTL
    data, _ := json.Marshal(p)
    rdb.Set(ctx, cacheKey, data, 10*time.Minute)

    return p, nil
}

Simple Redis caching in Python

Python
import json, redis

r = redis.Redis(host="localhost", port=6379, decode_responses=True)

def get_product(product_id: str) -> dict:
    cache_key = f"product:{product_id}"

    # 1. Try cache
    cached = r.get(cache_key)
    if cached:
        return json.loads(cached)  # Cache HIT

    # 2. Cache miss — query database
    product = query_product_from_db(product_id)

    # 3. Store with 10-minute TTL
    r.setex(cache_key, 600, json.dumps(product))
    return product

Cache Invalidation

There's a famous saying in computer science: "There are only two hard problems in programming: naming things, cache invalidation, and off-by-one errors." Cache invalidation — keeping cached data in sync with the source of truth — is genuinely one of the hardest problems in distributed systems.

Time-based expiration (TTL)

The simplest approach: set a Time To Live (TTL) on each cache entry. After the TTL expires, the entry is automatically deleted. The next request triggers a fresh database query. The challenge is choosing the right TTL — too short and you lose most caching benefit; too long and users see stale data.

Event-based invalidation

Instead of waiting for a timer, you explicitly delete or update the cache entry whenever the underlying data changes. For example, when a user updates their profile, the update handler also deletes the cached profile entry. The next read request will fetch fresh data from the database and repopulate the cache.

The advantage: data is always fresh. The risk: if you forget to invalidate at even one code path that modifies the data, you serve stale results. This requires discipline and thorough code review.

Caching Patterns

The caching pattern determines when and how data enters the cache. Three primary patterns dominate backend engineering:

1. Cache-Aside (Lazy Loading)

The most common pattern. Your application code manages the cache explicitly: check cache → on miss, query DB → store result in cache → return. On write, invalidate or delete the cache entry.

Cache-Aside Flow

Read: Check cache → hit? Return. Miss? Query DB → write to cache → return.
Write: Update DB → delete cache entry (next read repopulates).

2. Write-Through

Every write operation updates both the cache and the database simultaneously. Reads always find fresh data in the cache — there are no cache misses after the first write. The tradeoff: write latency increases because you're writing to two systems before returning a success response.

3. Write-Behind (Write-Back)

The write updates only the cache and returns immediately. The database update happens asynchronously in the background. This gives the fastest write latency but introduces risk: if the cache fails before the async write completes, data is lost. The cache and database may be temporarily inconsistent.

Pattern	Read Speed	Write Speed	Consistency	Complexity
Cache-Aside	Fast (after first miss)	Normal	Good (explicit invalidation)	Low
Write-Through	Always fast	Slower (dual write)	Strong	Medium
Write-Behind	Always fast	Fastest	Eventual (risk of loss)	High

Cache Hit Rate

The cache hit rate measures what percentage of requests are successfully served from the cache rather than falling through to the database. A 90% hit rate means 9 out of 10 requests are served from cache; only 10% touch the database. A 20% hit rate means your caching layer is barely working.

Cache Hit Rate = Cache Hits / (Cache Hits + Cache Misses) × 100%

Factors that affect hit rate

TTL duration: Longer TTLs keep items in the cache longer, increasing hit rate but risking staleness. Cache size: A larger cache holds more entries, naturally increasing the probability of a hit. Data access patterns: If your traffic is highly concentrated on a small set of "hot" items (power-law distribution), caching is extremely effective. If traffic is spread evenly across millions of unique keys, hit rates will be lower. Eviction policy: Algorithms like LRU (Least Recently Used) and LFU (Least Frequently Used) determine which entries are evicted when the cache is full — choosing the right one depends on your access patterns.

Target hit rates

For most web applications, a 70–95% hit rate is considered healthy. Below 50% suggests a fundamental mismatch between your caching strategy and your actual access patterns — time to re-examine your TTLs, cache keys, and eviction policy.

Vertical Scaling

Vertical scaling (also called "scaling up") means replacing your server with a more powerful machine — more CPU cores, more RAM, faster storage, better network cards. Your code doesn't change. Your architecture doesn't change. You simply upgrade the hardware.

Why it's appealing

Vertical scaling is the simplest form of scaling. A server with 2× the CPU cores handles roughly 2× the requests. A server with 2× the RAM can hold 2× the cached data. You avoid the entire category of distributed systems complexity: no load balancers, no state synchronization, no network partition handling. Economically, a single machine with 2× power often costs less than two machines of half its size, and you skip the operational overhead of managing a fleet.

Hard limits of vertical scaling

Hardware ceiling: Cloud providers offer machines up to a certain size. Once you've maxed out the largest available instance (e.g., 96 cores, 384 GB RAM), you simply cannot scale further. Single point of failure: One machine means one failure domain. If it crashes, your entire service goes offline. No geographic distribution: A single server in one data center means users on the other side of the world experience high latency — physics imposes a speed-of-light floor on cross-continental round trips (~100ms US↔India).

Fig 8 · Vertical scaling — bigger machines, same architecture, hard ceiling

Horizontal Scaling

Horizontal scaling (also called "scaling out") takes the opposite approach: instead of making one machine bigger, you add more machines of similar size that work together to handle the load.

Advantages

No hard ceiling: Need more capacity? Add more servers. There's no limit on how many instances you can run. Redundancy: If one server fails, the others continue serving traffic — the failure is absorbed, not catastrophic. Geographic distribution: You can place servers in multiple data centers around the world, routing users to the nearest one for minimal latency.

The complexity cost

Horizontal scaling introduces an entirely new category of problems — the problems of distributed systems:

Request distribution: How do you decide which server handles each request? You need a load balancer — a new component in your infrastructure. State synchronization: If user A updates their profile on Server 1, how does Server 2 know about it? Network partitions: What happens when the network connecting your servers fails? Servers may make conflicting decisions. Health checking: How do you detect when a server is down, and how do you reroute traffic away from it and back to it when it recovers?

All these problems have solutions — load balancers, shared databases, consensus algorithms, health checks — but each solution adds complexity and requires careful engineering. Distributed systems don't eliminate problems; they transform one set of problems into a different set, and you must evaluate which set of problems is more manageable for your situation.

Vertical Scaling

Simple — no architectural changes
No distributed systems complexity
Economically efficient at small scale
Hard hardware ceiling
Single point of failure
No geographic distribution

Horizontal Scaling

No capacity ceiling
Built-in redundancy
Geographic distribution possible
Requires load balancing
State synchronization needed
Operationally complex

Load Balancing

The moment you have more than one server, you need a mechanism to distribute incoming requests across them. That mechanism is the load balancer. It sits in front of your servers, receives all incoming traffic, and forwards each request to one of the available backend instances based on a chosen algorithm.

Common algorithms

Round Robin: Requests are distributed sequentially — server 1, server 2, server 3, repeat. Simple and even. Least Connections: Each request goes to the server currently handling the fewest active connections. Better for requests with varying duration. IP Hash: The client's IP address determines which server receives the request, ensuring the same client always hits the same server (useful for session affinity). Weighted Round Robin: Servers with more capacity receive proportionally more requests.

Fig 9 · Load balancer distributes traffic across geographically distributed servers

Load balancer in practice (Nginx config)

Nginx
upstream backend {
    # Least connections algorithm
    least_conn;

    server 10.0.1.10:8080 weight=3;   # stronger machine
    server 10.0.1.11:8080 weight=2;
    server 10.0.1.12:8080 weight=2;
    server 10.0.1.13:8080 backup;     # only used if others are down
}

server {
    listen 443 ssl;
    server_name api.example.com;

    location / {
        proxy_pass http://backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}

Health checks

Load balancers periodically send health check probes (HTTP requests to a /health endpoint) to each backend. If a server fails to respond, the load balancer stops routing traffic to it. When the server recovers, traffic resumes. This is the mechanism that provides the redundancy benefit of horizontal scaling.

Simple health endpoint in Go

Go
http.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) {
    // Check database connectivity
    if err := db.PingContext(r.Context()); err != nil {
        w.WriteHeader(http.StatusServiceUnavailable)
        json.NewEncoder(w).Encode(map[string]string{
            "status": "unhealthy",
            "reason": "database unreachable",
        })
        return
    }
    w.WriteHeader(http.StatusOK)
    json.NewEncoder(w).Encode(map[string]string{
        "status": "healthy",
    })
})

∑

Summary & Key Principles

Performance engineering is not about memorizing a checklist of techniques. It's about building mental models for how systems behave under load and applying the right tool at the right time. Here are the core principles:

1. Measure, don't guess. Never assume where the bottleneck is. Use timing middleware, distributed tracing, profilers, and EXPLAIN ANALYZE to identify the actual problem before investing in a solution.

2. Think in percentiles, not averages. P50 tells you the typical experience. P99 tells you the worst experience — and it often represents your most valuable users doing your most complex operations.

3. Respect the utilization curve. Latency grows exponentially as utilization approaches 100%. Keep headroom for traffic bursts. Target 60–80% utilization in production.

4. Fix the database first. Eliminate N+1 queries, add appropriate indexes (but not too many), and use connection pooling. These low-hanging optimizations often provide the biggest gains.

5. Cache strategically. Understand your access patterns. Choose the right caching pattern (cache-aside, write-through, write-behind). Monitor your cache hit rate. Remember that cache invalidation is genuinely hard — respect it.

6. Scale appropriately. Start with vertical scaling — it's simpler and cheaper. Move to horizontal scaling when you hit hardware limits, need redundancy, or need geographic distribution. Understand that horizontal scaling transforms your problems rather than eliminating them.

📖 Further reading: MDN Web Performance · Postgres Performance Tips · OpenTelemetry Docs · Redis Documentation

↑ Back to top

Mastering Scaling& Performance

What Do We Mean by "Fast"?

Latency Deep Dive

Latency is not a single number

Percentiles: P50, P90, P99

Why P99 and P95 matter most

Measuring percentiles in Go

Measuring percentiles in Python

Throughput

The latency–throughput curve

Utilization & Queuing

The ice cream shop analogy

The highway analogy

The exponential relationship

Finding Bottlenecks

The trap of jumping to solutions

A real-world example

Timing middleware in Go

Timing decorator in Python (FastAPI)

Profiling & Flame Graphs

How a profiler works

Flame graphs

CPU profiling in Go with pprof

CPU profiling in Python

CPU-bound vs I/O-bound bottlenecks

Distributed Tracing

The N+1 Query Problem

How it happens

The fix: batch fetching

Go — solving N+1 with a batch query

Python (Django) — select_related & prefetch_related

Indexes & Query Plans

The library analogy

Sequential scan vs. index scan

Creating indexes

Composite vs. Covering indexes

Composite Index

Covering Index

The cost of indexes

EXPLAIN ANALYZE — reading the query plan

Connection Pooling

The two problems

Connection pooling

Internal vs. external pooling

Internal Pool

External Pool

Connection pooling in Go

Connection pooling in Python (SQLAlchemy)

Caching Fundamentals

Local vs. distributed caching

Local Cache

Distributed Cache

Tiered caching

Simple Redis caching in Go

Simple Redis caching in Python

Cache Invalidation

Time-based expiration (TTL)

Event-based invalidation

Caching Patterns

1. Cache-Aside (Lazy Loading)

2. Write-Through

3. Write-Behind (Write-Back)

Cache Hit Rate

Factors that affect hit rate

Vertical Scaling

Why it's appealing

Hard limits of vertical scaling

Horizontal Scaling

Advantages

The complexity cost

Vertical Scaling

Horizontal Scaling

Load Balancing

Common algorithms

Load balancer in practice (Nginx config)

Health checks

Simple health endpoint in Go

Summary & Key Principles

Mastering Scaling
& Performance